Online medical journal article layout analysis

نویسندگان

  • Jie Zou
  • Daniel X. Le
  • George R. Thoma
چکیده

We describe a physical and logical layout analysis algorithm, which is applied to segment and label online medical journal articles (regular HTML and PDF-Converted-HTML files). For these articles, the geometric layout of the Web page is the most important cue for physical layout analysis. The key to physical layout analysis is then to render the HTML file in a Web browser, so that the visual information in zones (composed of one or a set of HTML DOM nodes), especially their relative position, can be utilized. The recursive X-Y cut algorithm is adopted to construct a hierarchical zone tree structure. In logical layout analysis, both geometric and linguistic features are used. The HTML documents are modeled by a Hidden Markov Model with 16 states, and the Viterbi algorithm is then used to find the optimal label sequence, concluding the logical layout analysis.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Bibliographic data extraction from HTML medical journal articles

MEDLINE, a biomedical literature database compiled by the US National Library of Medicine, contains 15 million records from approximately 5000 selected journals, and is searched over 3million times a day worldwide. With more journal articles being published online in hypertext markup language (HTML), the automatic extraction of bibliographic data from HTML articles is important for creating MED...

متن کامل

Automated Document Labeling

An increasing number of publishers are using the Internet and the World Wide Web to provide their subscribers with access to online journals. New techniques are needed to capture, classify, analyze, extract, modify, and reformat Web-based document information for computer storage, access, and processing. An R&D division of the National Library of Medicine (NLM) is developing an automated system...

متن کامل

Style-independent document labeling: design and performance evaluation

The Medical Article Records System or MARS has been developed at the U.S. National Library of Medicine (NLM) for automated data entry of bibliographical information from medical journals into MEDLINE®, the premier bibliographic citation database at NLM. Currently, a rule-based algorithm (called ZoneCzar) is used for labeling important bibliographical fields (title, author, affiliation, and abst...

متن کامل

Online analysis of local field potentials for seizure detection in freely moving rats

Objective(s): Seizure detection during online recording of electrophysiological parameters is very important in epileptic patients. In the present study, online analysis of field potential recordings was used for detecting spontaneous seizures in epileptic animals.Materials and Methods: Epilepsy was induced in rats by pilocarpine injecti...

متن کامل

Layout Definition of Online Magazines with Splitter Components

The capabilities of current mobile devices and the quality of their screens reached a level, where online reading experience competes with the printed media. Commercially printed magazines and newspapers commonly apply different grid-based page designs. In case of the online magazines the variable conditions, e.g. screen resolution, user preferences and the actual content require to provide ada...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2007